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Abstract. We introduce a stochastic approximation metiiod for tiie solution of an ergodic KuUback-Leibler 
control problem. A KuUback-Leibler control problem is a Markov decision process on a finite state space in 
which the control cost is proportional to a KuUback-Leibler divergence of the controlled transition proba- 
bilities with respect to the uncontrolled transition probabilities. The algorithm discussed in this work aUows 
for a sound theoretical analysis using the ODE method. In a numerical experiment the algorithm is shown 
to be comparable to the power method and the related Z-learning algorithm in terms of convergence speed. 
It may be used as the basis of a reinforcement learning style algorithm for Markov decision problems. 



1. Introduction 

In reinforcement learning (3] [13] we are interested in making optimal decisions in an uncertain envi- 
ronment. 

Consider the setting where we are condemned to reside in a certain finite environment for an indefinite 
amount of time. Whenever we make a move in the environment from one state to another state, we 
incur a certain cost, depending on the transition. We cannot directly influence this incurred cost, but 
can hope to make transitions yielding a minimal average cost per transition. 

This is an example of a Markov decision process \ 13 1 and in this paper we present a method that ap- 
proximately solves this problem in a very general setting. The algorithm we present, KL- learning (Al- 
gorithm[T), observes randomly made moves (according to some Markov chain transition probabilities) 
and costs we incur, and finds from this, at no significant computational cost whatsoever, improved 
transition probabilities for the Markov chain. 

This is in contrast to some other well known reinforcement learning algorithms, in which at every iter- 
ation an optimization over possible actions is necessary (e.g. Q-learning, L15j) or in which an optimiza- 
tion step is necessary to compute optimal actions (e.g. TD-learning, [121 ). 

The background for this method is the setting of the KuUback-Leibler (KL] control problem, introduced 
in 1 14|. A KL-control problem is a Markov decision process in which the control costs are proportional 
to a KuUback-Leibler divergence or relative entropy. In |14J also a reinforcement style learning algorithm 
(Z-learning) was presented, which operates under the assumption of there being an absorbing state in 
which no further costs are incurred. This assumption is not made in our algorithm; instead we assume 
ergodicity of the underlying Markov chain. Arguably this yields a more general setting, in which a hard 
reset of the algorithm is never necesssary. KL control problems may also be solved using techniques 
from graphical model inference ID . 

As a preliminary, we introduce the KL-control setting in Section|2l In Section|3]the KL-learning algo- 
rithm is presented and motivated on a heuristic level. We then describe the ODE method LLi.8 lOJ in 
Section|4]along with an application to a stochastic gradient algorithm and Z-learning [M] as illustrative 
examples. We then apply the ODE method to KL-learning in SectionlH A numerical example is provided 
in Section[6]after which a short discussion follows in Section!?] 
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2. Kullback-Leibler control problems 

In this section we introduce the particular form of Markov decision process which have a particularly 
convenient solution. We will refer to these problems as Kullback-Leibler control problems. For a more 
detailed introduction, see |14|. 

Let f = 0, 1, 2, . . . denote time. Consider a Markov chain (Xfj^g on a finite state space S = {l,...,n] with 
transition probabilities q = Iqijli)] which we call the uncontrolled dynamics. We will make no distinc- 
tion between the notation qij and q{j\i), where q{j\i) - qij denotes the probability of jumping from 
state / to state j; the notation qij will me more convenient when working with matrices. 

Suppose for every jump of the Markov chain from state i to state 7 in S a transition dependent cost 
c{i\i) is incurred. Sometimes we will use the notation c{i) to denote costs depending only state, i.e. 
c{i\i) - c{i) for all i,] -\,...,n. h state / is called absorbing]! q{i\i)-\ and c{i) - 0. 

We wish to change the transition probabilities in such a way as to minimize for example the total in- 
curred cost (assuming there exist absorbing states where no further costs are incurred) or the average 
cost per stage. For deviating from the transition probabilities control costs are incurred equal to 

1 1 A I piXt+i = i\Xt) 

P P j^i \q(Xt+i^ j\Xt) 

at every time step, in addition to the cost per transition ciXt\Xt-\), where ;S > is a weighing factor, 
indicating the relative importance of the control costs. 

To put this problem in the usual form of a discrete time stochastic optimal control problem, we write 
Pij - exp[uj(i)]qij. This guarantees positive probabilities and absolute continuity of the controlled 
dynamics with respect to the uncontrolled dynamics. In the case of an infinite horizon problem and 
minimization of a total expected cost problem, the corresponding Bellman equation for the value func- 
tion O is 

f " 

<P(/)= min \yc{j\i) + q{j\i)exp{Uj){Ujll3 + <^(.j)'l 



(Ul,...,i(„)Ei 



where the minimization is over all Ui,... m„ such that expiuj]qij\i) - 1. If there are no absorbing 
states, the total cost will always be infinite and the expression above has no meaning. We may then 
instead aim to minimize the expected average cost. For an average cost problem, the Bellman equation 
for the value function (P is 

(1) p + <I>(i)^ min c(j\i] + qU\i)exp[Uj)(Uj I p + (i>U]) 



(Ul,...,U„)E 



where again the minimization is over all ui,... Un such that 'E."^^exp{Uj)qij - 1, and where p is the 
optimal average cost. In the average cost case we restrict the possible solutions by requiring that 

n 

(2) ^exp(-^a)(«)) = l; 

1=1 

otherwise any addition by a scalar would result in another possible value function. The reason for the 
particular form of this restriction will become clear later. 

Note that in case the total expected cost problem has a finite value function, the solution of the average 
cost problem (T) would have a solution with p -0. This shows that in a sense the average cost problem 
is more general, since then (1) remains valid for the total expected cost problem. Therefore we will 
henceforth only consider the average cost problem case. 

So far the derivations have been standard; see |2] for more information on dynamic programming and 
the Bellman equation. 
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It is remarkable that a straightforward computation using Lagrange multipliers, as in [14] , yields that 
the optimal uj (/) and value function <I> solving (I) are given by the simple expressions 

(3) u*[i)^ln[z*irz*)-l3c[j\i)), OU) ^ -^Mz*), 
with z* £ K" given implicitly by 

n 

X*z* = X ew{-pc{i\i))q{j\i)z*, 
which may be written as A* z* - Hz* , with 

(4) H-[hij] with entries hij - exTp{-pc[j\i)]q{j\i] 

and where A* - exp(-/3p*). This z* should be normalized in such a way that the value function agrees 
with the value in the absorbing states for a total expected cost problem, or with the normalization (2) 
in the average cost case, which is chosen in such a way that it corresponds to ||z* ||i - ZJL^ z* - 1. The 
optimal transition probabilities simplify to 

PUU)* = qU\i)exp(-pcU\i))^. 

i 

According to Perron-Frobenius theory of non-negative matrices (see (5|), if the uncontrolled Markov 
chain q is irreducible then there exists, by Observation l2. 1 I below, a simple eigenvalue A* of H equal to 
the spectral radius p{H), with an eigenvector z* which has only positive entries. Since A* is a simple 
eigenvalue, z* is unique op to multiplication by a positive scalar. These A* and z* (with z* normalized 
as above) are called the Perron-Frobenius eigenvalue and eigenvector, respectively. The optimal average 
cost is given by p* = -^InA*. In case of a total expected cost problem, where p* = 0, it follows that 
A* - 1, which may also be shown directly by analysis of the matrix H. 

Recall that a nonnegative matrix A is called irreducible if for every pair i,j e S, there exists an m e Nl 
such that {A'")ij > 0. In particular, a Markov chain p is called irreducible if the above property holds for 
its transition matrix. 

2.1. Observation. Suppose the finite Markov chain q onS- {l,...,n} with transition probabilities q{j\i) 
is irreducible. Then H as given by l|4) is irreducible. In particular, there exists a unique (modulo scalar 
multiples) positive eigenvector z* £ K" of such that _f/z* = A*z*, where A* = p(H) = sup^^g-j^ |^|, the 
spectral radius of H. 

Proof. Let j - minxes e"'^'^'^'*' . Let i,j eS and pick meN such that iq"^) ij > 0. Then 

n n n n 

The existence and uniqueness of the eigenvalue and corresponding eigenvector is then an immediate 
corollary ofthe Perron-Frobenius theorem 1 5 Theorem 8.4.4]. □ 

Recall that a Markov chain [pij ] is said to satisfy detailed balance if there exists a probability distribution 
ipi) such that PiPij = PjPji for all i,j. In this case ipi) is an invariant probability distribution for the 
Markov chain. 

2.2. Proposition. Suppose the uncontrolled dynamics q satisfy detailed balance (with respect to the 
invariant probability distribution given by (^7^■))• 

(a) If the transition costs are actually state costs, i.e. c{j\i) = c{i) for i,j = 1, . . . , n, then the optimal 
controlled dynamics satisfy detailed balance with invariant probability distribution given by 

PiOi qiexp[pc{i)){Zif, i-l,...,n. 
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(b) If the transition costs are symmetric, i.e. c[j\i) - c[i\j) for i,j - l,...,n, then the optimal con- 
trolled dynamics satisfy detailed balance with invariant probability distribution give by 

PiOi qi{z*f, i = l,...,n. 
Proof. We will prove (a), the proof of (b) is analogous. 

Using that ptj - exp(u* {i))qij with u* [i) given by (3), we verify that piPij - PjPji for all Indeed, 

PiPij = qiexp{l3c{i)){ZifqijZ*lz*exp{-pc{i))IZ= qiqijZ*z*/Z 

= qjqjiZ*z*IZ^qjexp{pc{j]){z*fqjiZ*lz*exp{-l3c[j]/Z^pjPji, 

where Z = Z^^j qk exp(/3c(A;)) (zp^ is a normalization constant. □ 

2.3. Example: solution in case of trivial detailed balance. If we take as uncontrolled dynamics qij - 
qj, where qj is a probability distribution on {1, ... , n}, then Hij - exp(,-l3c{i))qj is of rank one and has 
non-zero eigenvalue A* - qj exp{-pc{j]) with eigenvector z* given by zj' - exp(-/ic(/)). The opti- 
mal transition probabilities are given by 

n 

Pij = qjexpi-pcij))/ X <7fcexp(-j3c(fc)), 

k=l 

which again are independent of i. Therefore the Markov chain given by the controlled dynamics has 
invariant probability distribution [pj] = [pij]- The optimal average cost is given by 

1 I " 

p* = --^InA* = --^In^ <7fcexp(-/3c(fc)). 

P P k=\ 



3. KL-LEARNING 

As explained in the previous section, a KuUback-Leibler control problem may be solved by finding the 
Perron- Frobenius eigenvalue A* and eigenvector z* of the matrix H given by iS). 

A straightforward way to find A* and z* is using the power method, i.e. by performing the iteration 

Hz^ 

ya) Zfc+i - — — -. 

WHZkW 

This assumes that we have access to the full matrix H. Our goal is to relax this assumption, and to find 
z by iteratively stepping through states of the Markov chain using the uncontrolled dynamics q, using 
only the observations of the cost c( j | /) when we make a transition from state i to state j.. 

In [M] a stochastic approximation algorithm (see (Il|3l|7l|8]), referred to as Z-learning, is introduced for 
the case A* = 1. We will extend this method here to the case where A* is a priori unknown. 

In this section we will denote vectors by bold letters, e.g. v. Components of this vector will be denoted 
as v{i) or !/,-. The notation 11 is used for the column vector containing only ones. A vector v e R" is said 
to be nonnegative > 0) if v{i) > for aU / = \,...,n and positive (i^ > 0) if v{i) > for all / = 1, . . . , n. 

The algorithm we will consider is Algorithm[T] The parameter M eN denotes the number of iterations 
of the algorithm, and y > indicates the stepsize. We assume that the Markov transition probabilities 
q{-\-) are irreducible and aperiodic, and hence ergodic. 

At every iteration, we make a random jump to a new state. Based on our observation of the incurred cost 
at the previous step, and current values of A and two components of z, a number A is computed that 
says how much z and A should be changed. The value of A is always equal to Z ^ z,- = 1 1 z,- 1 1 1 . Note that 
every step of the iteration consists of only simple algebraic operations and hence has time complexity 
). In particular, no optimization is needed, as opposed to e.g. Q-learning il5il . 
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Algorithm 1 KL-learning 

z^il, A^l, any state in S 
for fc = 1 to M do 

y ^ independent draw from q{-\x) 
A ^ exp{-pc{y\x))z{y)/A - z[x) 
z{x) ^ z(x) +7A 
A ^ A + 7A 

end for 



A theoretical analysis of (a slightly modified version of) this algorithm will be performed in Section |5l 
The results of that section are summarized in Theorem l5.2l First we provide some intuition. 

3.1. Heuristic motivation. Suppose at time m we are in state i. The expected value of A is 

n 

clij {exp{-l3cij)z{j)/A- zii)) = (Hz)i/A-Zi. 

Since A = 1 1 z| 1 1 , the update to z may be interpreted as 

Zne^^, ^ z + r[{Hz){i)l X- z{i)]^ a-j)z{i) +j[Hz)[i)ll\z\h, 

a convex combination of the old value of z{i) and the value z(«) would obtain after an iteration of the 
power method described above. The normalization is however based on the previous value of z but this 
does not affect the convergence of the power method. 

The frequency of updates to the i-th component of z depends, on the long run, on the equilibrium 
distribution [qt) of the underlying Markov chain. This will be a major concern in the convergence anal- 
ysis of the algorithm. It will turn out that the convergence of the algorithm will depend on the stability 
properties of a certain matrix, A say. If we wish the algorithm to converge for a certain invariant distri- 
bution, this corresponds to the matrix DA being stable, where D is a diagonal matrix with the invariant 
distribution on the diagonal. This will be made clear in Section[5l 

4. Analysis of stochastic approximation algorithms through the ODE method 

In this section a general and powerful method for analyzing the behaviour and possible convergence of 
stochastic approximation algorithms is described. It will be applied to Algorithm [1] in Section[5l This 
method, called the ODE method was first introduced by Ljung |10| and developed significantly by 
Kushner and coworkers [7l[8]. Accounts that are well suited for computer scientists and engineers may 
be found in [Tl|3]. 

The theory is illustrated by applying it to some stochastic algorithms. The new contribution of this sec- 
tion to the existing theory is the necessity of diagonal stability for the convergence of certain stochastic 
algorithms, as discussed in Section fO] 

4.1. Outline of the ODE method. The idea of the ODE method is to establish a relation between the 
trajectories of a stochastic algorithm with decreasing stepsize, and the trajectories of an ordinary dif- 
ferential equation. If all trajectories of the ODE converge to a certain equilibrium point, the same can 
then be said about trajectories of the stochastic algorithm. This is made more precise in the following 
theorem, which is a special case of [8l Theorem 6.6.1] tailored to our needs. 

4.2. Hypotheses. Consider the general stochastic approximation algorithm given by Algorithm [2l as- 
suming the following assumptions and notation: 

(i) Let 71, 72, ... be a sequence of step sizes, satisfying Jk-oo and 7^ < 00; 



Here ODE is an abbreviation for ordinary differential equation. 
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Algorithm 2 General stochastic approximation algorithm for theoretical analysis 

xq ^ any state in S 

do ^ any initial vector in IR" 

for fc = 1 to oo do 

Xjt ^ independent fraw from q{-\X]c-i) 

dk ^ 6k-i + Tkfidk-hXk-i, Xfc) 
end for 



(ii) Let q[-\-) be irreducible aperiodic Markov transition probabilities on a finite state space S with 
invariant probabilities qt, i e S; 

(iii) Suppose that {flj; : A; e N} c iiT with probability one, where K is some compact (i.e. closed and 
bounded) subset of IR"; 

(iv) Suppose (0, X, y) X, y) : _Rr X S X S ^ R" is continuous in 6 for every x,yeS. 

(v) Define g:r^ll«" by 

(6) gm:^Y.ll^i^ijf^S,i,j). 

i£Sj£S 

(vi) Define fo and tj, := X^Li Ti for keN. Denote, for all f > and = 0, 1,2,..., 0*^(f) := dp for 
the unique p such that tp < t+ tj^K tp+i, and 0*^(f) := if f + fj; < to- 

These assumptions are sufficient for our purposes. The sequence 7 denotes the stepsize or gain. The 
conditions under (i) on 7 are standard conditions to guarantee that the gain gradually decreases, but 
not too quickly, in which case the algorithm would stop making significant updates before being able 
to converge. 

In [8] more general classes of algorithms and assumptions are considered. 

4.3. Theorem (convergence of stochastic algorithms with state dependent updates). Suppose As- 
sumptions l4.2l hold. Then, with full probability, 

(i) Every sequence in the collection of functions {0*^ : A; £ N} (as defined under Asstimption l4.2 1 (vi) ) 
admits a convergent subsequence with a continuous limit0 

(ii) Let 9 denote the limit of some converging subsequence in {0*^ : fc £ (which always exists by 
(i)). Then 9 satisfies the ODE 

(7) = g(0) 

(iii) If a set j4 c IR" is globally asymptotically stable with respect to the ODE (7), then 9k A, i.e. 

min^eA \dk -x\^0. 

Outline of proof. The proof consists of a verification of the conditions of |8l Theorem 6.6.1]. One key 
ingredient for this verification is Lemma fOl below. which says that convergence of the pair (Xfc_i, Xjt) to 
its equilibrium distribution {qiqij)ijes happens exponentially fast. o 

Recall the total variation distance (9] Section 4. 1] of two probability measures ^1 , ^2 on a discrete space 
S, 

IIMi - M2IItv:= sup |/ii(^)- ^2(^)1 = lf^i(«)-A^2(')l- 

Acs ,-eS 

4.4. Lemma (Markov chain convergence to invariant distribution). Let qlj\i), i,j £ S, denote the 
transition probabilities of an irreducible, aperiodic Markov chain X on a finite state space S with in- 
variant distribution qi, i £ S. Let p.^ be the probability measure on S x S denoting the distribution of 
(Xj;_i,Xfc) given Xq = x. Let /i denote the probability measure on S x S given by 7i(i,7) = qiq{j\i). 



■Here by convergence we mean uniform convergence on bounded intervals. 
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Then there exist constants a £ (0, 1) and C > such that 

(8) maxllii^ -TIIItv < Ca*', forall keN. 

X£S " 

Proof: Let denote the probability measure on S denoting the distribution of given initial condition 
Xq = jc. By I9j Theorem 4.9], there exist constants C > and a e (0, 1) such that 

max\\v-l_-^^-q\\TV<Ca''~^ foikeN. 

Therefore 

m£Dc||^^-7i|lTv = maxi^ ^ \PiXt-i = i,Xk = j\Xo = x) - qiq{j\i)\ 

x£S xeS i^sjeS 

= maxi ^ q{j\i) ^ |P(Xfc_i = i\Xo = x) - qA 

i€S jeS 

^max\Y.\P(Xk-i = i\Xo^x)-qi\=max\\vl_^-q\\Ty<Ca''~\ 

By letting C = C/a we find that (8) holds. □ 

4.5. Remark. Note that a boundedness assumption is made in Theorem l4.3l In practice, this is not an 
unreasonable assumption, since float sizes are bounded in many programming languages. The bound- 
edness may be enforced by a projection step in the algorithm, leading to a slightly more complex for- 
mulation of Theorem l4.3l In particular, the resulting ODE becomes a projected ODE. See (8] Section 
4.3]. 



4.6. Example: A stochastic gradient algorithm. Suppose we wish to minimize a function h:R"^U. 
with bounded first derivatives but that we do not have full access to its gradient gj - Instead, the 
observations we make are determined by an underlying IVIarkov chain (xj;) on the state space {!,..., n} 
with aperiodic, irreducible transition probabilities qtj . In case a jump is made to , we observe -^^^ id) 
for some 6 eR". Is there a stochastic approximation algorithm that can minimize h under these restric- 
tive conditions? 

Consider Algorithm|3] We use e,- to denote the unit vector in direction i. 



Algorithm 3 Stochastic gradient algorithm 

Xo ^ any state in S 

00 ^ any initial vector in K" 

for fc = 1 to oo do 

Xjt ^ independent fraw from q{-\X]c-i) 

Ok - 0fc-i - Yk ^^ggi'if e^x^-i) 
end for 



Since Vh\s bounded, the trajectories of this algorithm are restricted to the bounded set 

7<: = {0 e K" : |0| < max(|0ol, 
with probability one. The corresponding ODE (in the sense of Theorem l4.31 is (7) with 

(9) S'm^-q^^^q^i^. 

Let R - [rij] be the matrix defined by rij - qiqij, i,] - 1, . . . , n. We may then write 

(10) g(0) = -J?V/z(0). 

Clearly the minimum 9* ofh, where ^h{6*) - 0, gives an equilibrium point of this ODE. It is not imme- 
diately clear whether this is the only equilibrium point. 
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We will now make the following assumptions: 

(11) The matrix R given by the entries r, j := qt qtj is symmetric, positive definite. 

(12) The function h is twice differentiable and strictly convex. 

Sufficient conditions for (TT) to hold are the following: 

(i) The Markov chain given by qtj satisfies detailed balance, i.e. qi qij - qj qji for i, 7 = 1, . . . , n; 

(ii) The Markov chain given by qtj is strictly lazy, i.e. qa > | for / = l,...,n. 

Indeed, if these conditions are satisfied, then R is symmetric by the detailed balance condition. Since 
the Markov chain is lazy, R is strictly row diagonally dominant, so that all its eigenvalues are positive. 

Under Assumption (TT), define a Lyapunov function 1/ : K" ^ IR by 

y(0) = i{v/z(0),j?v/z(0)>, 

where <•, •> denotes the Euclidean inner product on R". 

Write H[9) - i 1 to denote the Hessian matrix of h at 0, and note that, since h is strictly 

^,o0'6Ui,; = l,...,n ^ 

convex, the matrix H{6) is positive definite for all 6 eU.". Then if 0(?) satisfies fTO) , 
d 

— V{e(t)] = {Vhmt)),RH(d{t])B{t)) = -{Vhmt)),RH{d{t])RVh{e{t))) < 0, 
dt 

with strict inequality if S/h{9) ^ 0. This shows that the ODE fTOl is globally asymptotically stable with 
unique equilibrium 6* satisfying Vh{d*) - 0. Bv Theorem l4.3l (iii) therefore Algorithm [3] converges al- 
most surely to 6* . 

In this case we are in some sense lucky to be able to find a Lyapunov function to establish global stability 
of the ODE. In the case of Algorithm[T](KL-learning) we have not yet found a global Lyapunov function 
and so far can only achieve local stability around the equilibrium in certain cases. For illustrative pur- 
poses, we now also perform such a local analysis to the current example. 

It is immediately clear that under assumption llll . the only equilibrium of the ODE IIOI satisfies S/h{6) - 
0. It remains to establish the stability of the ODE around that equilibrium point. The linearized version 
of the ODE around the equilibrium is 

(13) Bit) = -RH{e*)B. 

We therefore need to determine the spectrum of the matrix RH{6*]. Indeed RH{9*]R + RH{d*]R is 
positive definite, so that by Lyapunov's theorem [6,, Theorem 2.2.1] RH{9*] has only eigenvalues in the 
open right halfplane. We may conclude from this local analysis, by the Hartman-Grobman theorem [11] 
Section 2.8], that the equilibrium 0* is locally asymptotically stable. 

4.7. Remark. Under the assumption that we can only observe ^ if we jump to state j, a simpler al- 
gorithm would consist of the update rule 9k ^ 0|fc-i ~Tk^^7^Sk: i-e. to update the (Xfc)-th compo- 
nent of 9 instead of the (xj;_i)-th component. In this case a Lyapunov function would be given by 

V{9) - (Ji (^^f^) ^iid Assumption (TT) would not be required. However, the analysis of Algo- 
rithm[3]has more in common with the upcoming analysis of Algorithm[T](KL-learning), because in that 
algorithm the updates also depend upon the previous and current state of the Markov chain. 

4.8. Example: Z-learning. In fT4l , the Z-learning algorithm is presented as a way to solve the eigenvec- 
tor problem Hz* - z*, where H - [hij] is a nonnegative irreducible matrix with spectral radius p{H) - 1 
of the form - exp{-pcjj)qij as in Section|2] with [qij] the transition probabilities of some irreducible 
Markov chain onS- {l,...,n}. This problem is an important special case of the problem we address in 
this paper, namely solving Hz* - A* z* with unknown spectral radius p{H) - A* . 

The Z-learning algorithm is given by AlgorithmlH 
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Algorithm 4 Z-learning 

zo ^ H, xo ^ any state in S 
for fc = 1 to oo do 

Xjfc ^ independent draw from qi\xfc-i) 

Zk ^ Zfc-i +rfc(exp(-/3c(Xfc|Xfc_i))Zfc_i(Xfc) - Zfc_i(Xfc_i)) ex^_i 
end for 



The corresponding ODE (in the sense of Theorem l4.3l is given by 
(14) z(f) = -D(7-H)z(f), 

where D is a diagonal matrix given by da - qi , where [qi) denotes the invariant probability distribution 
of the Markov chain given by [qij]. 

It is immediate from the Perron-Frobenius theorem that the eigenvalues of the matrix I - H are strictly 
contained in the closed right halfplane, with a one-dimensional eigenspace corresponding to the zero 
eigenvalue and all other eigenvalues having strictly positive real part. This stiU holds for a multiplication 
of / - H by an arbitrary diagonal matrix with positive diagonal entries, but this is less immediate (see 
e.g. f6 Exercise 2.5.2] for the nonsingular case). Therefore the linear subspace spanned by z* is globally 
attracting, and the Z-learning algorithm converges to this subspace by Theorem l4.3l (iii). 

4.9. D-stability as a necessary condition for convergence of stochastic approximation algorithms. 

In the previous example, the positive stability of I - H carried over to a multiplication by a positive 
diagonal matrix, DU - H), irrespective of the kind of diagonal matrix D. This kind of stability (invariant 
under left- (or right-) multiplication by an arbitrary positive diagonal matrix) is called D-stability in the 
literature (see e.g. |6 Section 2.5]), and the major difficulty with establishing local stability of the KL- 
learning algorithm consists of showing D-stability for the corresponding linearized ODE. 

5. Theoretical analysis of KL-learning 

The KL-learning algorithm (Algorithm [1) works well in practice, but for a rigorous theoretical analysis 
of its behaviour we need to make a few modifications, as given by Algorithm[5l 

The modifications of Algorithm[5]with respect to Algorithm[T]are: 

(i) The values of z, A and A are indexed by the time parameter fc to keep track of all values; 

(ii) Instead of a single step size 7 > and a finite time horizon M eN we consider an infinite time 
horizon and a decreasing sequence of stepsizes [jk)', 

(iii) At every iteration, if necessary, a projection is performed (in the computation of Aj;) to ensure 
that Xk s Amin := minjj£ji,...,„j exp(-;6c(7|i))/2. 

The modification (i) is purely a notational matter. Modification (ii) is standard in the analysis of stochas- 
tic approximation algorithms. If we would keep the stepsize constant the theoretical analysis would be 
harder. The practical effect of keeping the stepsize fixed is that the values of [Zk, Ajt) will oscillate around 
the theoretical solution (z*,A*) with a bandwith depending on 7. Modification (iii) has minimal prac- 
tical effect; we have not seen cases in which the projection step was actually made. The theoretical 
solution A* satisfies A* > 2Ainm by theory on nonnegative matrices |5 Corollary 8.1.19]. The constant 2 
is arbitrary, chosen to ensure that A* lies well above Amin- By Lemma |53] Afc is bounded from above, so 
there is no need to prevent Ajt from growing large. 

In this section we will write K" := {x e K" : x,- > for / = 1, . . . , n}. 

In the discussion below we will only refer to Algorithm^ 

The initial value for Ao is moreless arbitrary, but it is important that 1 1 zo 1 1 1 = Ao and XqE K with K given 
by Lemma lQ] We impose the following conditions. 
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Algorithm 5 KL-learning, notation for analysis 

Ao ^ 2Amin = mini,;=i,...,„exp(-;6c(;U')) 
zoii) ^ -^Ao, for all i = 1, . . . , n, 
xq ^ any state in S 
for fc = 1 to oo do 

Xjfc ^ independent draw from qi\Xk-i) 

Afc ^max{exp(-;Sc(Xfc|Xfc_i))Zfc_i(Xfc)/Afc_i -Zfc_i(Xfc_i),(Aniin-Aj;_i)/7fc} 

ZfcUfc-i) ^Zfc_i(Xfc_i)+7fcAfc 

^ ^k-l+Yk^k 
end for 



5.1. Hypothesis. 

(i) Let [jk) km be a sequence of nonnegative real numbers such that Tk - I^m=i Tfc 

(ii) Let q{-\-) be irreducible aperiodic Markov transition probabilities on the finite state space S - 
{l,...,n] with invariant probabilities qi, i e S; 

(iii) LetceU"''"; 

(iv) Let the matrix H = [hij] e K"**" be given by hjj - exp{-pcij)qij and the diagonal matrbc D = 

(iv) Define continuous time processes (z*^(f))(>o and (A*^(f))(>o for fc = 0, 1,... as in Hypothesis 14.21 
(vi). 

5.2. Theorem (Convergence of KL-learning). Consider Algorithm[5]under the conditions of Hypothe- 
sis l5.1l Then 

(a) With full probability, for any sequence of processes (z*^,A*^) there exists a subsequence uni- 
formly on bounded intervals to some continuous functions (z, A) , z : [0, oo) ^ K" and A : [0,oo) —- 
(0,oo). 

(b) The trajectories of Algorithm [S] as well as the limiting functions (z, A) given by (b), are con- 
strained to a closed, bounded set K given by (T7\ . 

(c) Such a limit (z. A) satisfies the ODE 

( z(f) = /(z(f),A(f)) + M;, 

\ A{t)^h{z{t),A{t)) + id, f>0, 

with f:Ulx (0,oo) ^ R" and h:Rl x {0,oo] given by 
(16) /(z,A):=D|iH-/jz, /i(z,A) :=1l^/(z,A), 

where 

D = diag(^7(l),...,^7(w)), 

with q the unique invariant probability distribution for the Markov chain with transition prob- 
abilities qtj. Here w e U"^ and /i > denote the minimum force necessary to keep (z(f), A(f)) 
in K (the continuous time equivalent of the projection step to ensure that Aj; > Amin; see HI 
Section 4.3]). 

(d) The ODE (15) admits a unique equilibrium (z*, A*) in the interior of K, where Hz* - A*z* and 
l|z*||i = A*. 

(e) If any of the conditions of Proposition 15. 101 hold, then the equilibrium (z*. A*) as mentioned 
under (d) is locally asymptotically stable. 

Proof. By Lemma [531 the trajectories of Algorithm[5]are constrained to the compact set K given by (17). 
We may apply a variant Theorem l4.3l suitable for projected algorithms (see (H Theorem 6.6.1] to deduce 
(a), (b) and (c), where for (c) we may use Lemma [574] Results (d) and (e) follow from Propositions [SjH 
Remark [5^l5.8l and l5.10l where we note that no projection force is necessary in the interior of K. □ 
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5.3. Lemma (algorithm invariants). Under Hvpothesis l5.ll the trajectories (zj;,Afc) of Algorithm[5]are 
contained in the compact set 

(17) K^{[z,A) e [0,M]" x [A^in,nM] : ||z||i = A}, 
with M:- max/j=i,..._„exp(-jSc;j); 

Proof. Note that, by the maximum operation in the algorithm, Afc > exp{- pc{X]c\xjc-i)) zjc-iixjc)- Zk-iixk-i)) . 
The update for zjc therefore satisfies 

(18) zjcixjc-i) > {l-Yk)Zk-i{Xk-i)+rkexp{-pc{Xk\Xk-i))Zk-i{Xk)/Ak-i>0 

It follows immediately by induction that zj;(i) > for i = l,...,n and k- l,...,n. The update-rule ^ 
Ajt-i + Tk^k for ensures that A^ - llz^Hi > Amin for all A; = 1, . . . , We will show by induction that 
Zfc(j) < Mfor all fc£ N and i = l,...,n. Recall that zqU) < ^ < M for i - 1,... ,n. Suppose Zk-i{i) < Mfor 
all i, and some fc e N. If no projection occurs 

Zfc(xj;_i) < (l-7fc)Zfc_i(xj;_i)+7fc max exp(-/3c,-/) < M, 

i,j=l,...,n 

where we used that Zjt_i (0 < A^-i for all If projection does occur, 

Zfc(Xfc_i) < Afc = Afc_i +7t:(Amin-^fc-l)/7fc = ''-min < M. 

□ 

5.4. Lemma. The function g corresponding to Algorithm|5]in the sense of Theorem l4.3l is given by 

7(z,A) 
h{z,A)\ ' 

where / and h are given by (16). 

Proof. A straightforward computation. □ 

5.5. Proposition. Suppose D is a diagonal matrix with positive entries and H a nonnegative matrix. 
Suppose / : K" X R ^ R" and /i : R" x R ^ R are given by (TB). 

Consider the ODE Qs) with initial values (z(0). A) such that z(0) > and A(0) = ||z(0)||i. 

The orbits (z(f),A(f)) satisfy z(f) > and A(f) = ||z(t)||i > for all f > 0. Furthermore a point (z,A) e 
R" X R is an equilibrium point if and only if Hz - Az. 

Proof. Suppose that for some f > 0, we have z(f) > 0, A(f) > 0. If for some component z(,t){k) of z(f) we 
have z(f)(fc) = 0, then 



g(Z,A): 



[/(z(0,A(r))](A;) = 

Also, as A i 0, then 



D\-^H-I 



zit) 



(k)^-^\DHz{mk)>0. 
A[t) 



/i(z,A) =11^D|j/J-7jz^oo, 
so that A always remains positive. For z(f) > 0, 

■^wzitm = y mik] = i\^m = g(z(f),A(f)) = A(t). 

dt fc=i 

Because z(0)(fc) >OforaUfc= l,...,wandA(0) - ||z(0)||i, itfoUowsthatz(f) > Oforall f and ||z(f)|| =A(f). 
It is straightforward that Hz - Az if and only if (z. A) is an equilibrium point. □ 

5.6. Remark. In light of the proposition above, we may consider the dynamical system 

(19) I ^^^^-D{j^H-l]zit), 

I z(0) = Zo, 

with Zo(fc) > for all A; = 1 n, instead of (15), thus reducing the dimensionality from R"^^ to R". 
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5.7. Definition. A matrix A £ M""" is called stable (or strictly stable) if all eigenvalues A e a{A) satisfy 
Re A < (or Re A < 0, respectively). 

5.8. Proposition. Suppose H is a nonnegative irreducible matrix and D is a diagonal matrix with posi- 
tive entries on the diagonal. 

Then (19) has a unique equilibrium point z* . This equilibrium point satisfies 

(i) ||z*||i=A*:=p(J/), 

(ii) z* > 0, and 

(iii) Hz* ^A*z*. 

The equilibrium is locally (asymptotically) stable if and only if the matrix _D(// - A* / - z* l^) is (strictly) 
stable. 

Proof. By Remark [5^ above, we may apply Proposition 15.51 to conclude that z* satisfies (iii) for some 
A* > 0, and z* > 0, z* ^ 0. Since H is nonnegative and irreducible, there is, up to scaling by a positive 
constant, only a single eigenvector with nonnegative components. This is the Perron vector whose 
eigenvalue satisfies A* - p{H), and which has only positive components, so that (i) and (ii) follow. 



The linearization of z D^j^^^H - /j z around z is given by 



which reduces to 



v^d\-^H-i\ V ^-^DHzi^v, 

] \\z\\\ 



v^D\^[H-z*i^)-l\v 



for z-z*. Multiplication by A* does not affect the stability properties of this matrix, so the stability of 
the equilibrium z* is determined by the spectrum of the matrix £)(H - z* 11^ - A* 7). □ 

The stability of the matrix D{H - A* / - z* 11^) seems to be a non-trivial issue. In Proposition l5.10l below 
we collect some facts that we have already obtained. For this we need a lemma. 



5.9. Lemma. Suppose H and D satisfy the conditions of Proposition l5.8l Then the matrixD(//- A*/- 
z*1^), with A* - p{H) and z* the corresponding positive eigenvector, is nonsingular. 

Proof. We will omit *- superscripts in this proof, so z = z* and A = A* . Write A - H- XI - zi^ . 

Let w and z denote the left and right Perron-Frobenius eigenvectors of H. Note that (z, w) > and 
(11, z) > 0. Also note that any ^ e B?" may be written as = az -i- 77, with 77 ± 1), by picking a - (^, 1l)/(z,'II) 
and T]-(- az. Therefore we may choose a basis of K" consisting of the vector vi- z and some vectors 
V2,...,Vn spanning 11 Let S denote the matrix with columns v\,..., Vn- Then the first column of S~'^(H- 
A/)S consists of zeroes since [H - XI) z - 0, and only the first column of S~^zl^S is nonzero. So adding 
S-^zl^S to S~^{H-XI)S only increases the range of the resulting matrix. Therefore, range(H- A/) c 
range (j4). 

Since [H - XI) - 0^, we have that w is perpendicular to the range of H- XI. But w is not perpendic- 
ular to the range of A since (w,z) > 0. In other words, the inclusion range(J/ - XI) c range(A) is strict, 
rank(^) > rank(H - A7) = n - I, so that rank(^) = n and det(A) ^ 0. Hence also det(DA) ^0. □ 



5.10. Proposition. Suppose H and D satisfy the conditions of Proposition l5.8l The matrixD(7/- A*/- 
z*!^), with A* - p{H) and z* the corresponding positive eigenvector, is strictly stable in any of the 
following cases: 



(i) D = ^/ for some > 0, 

(ii) 11 ^ H = A * 11 ^ (so 11 ^ is a left Perron vector) , 
(ii) AHeR^xz. 
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Proof. As before, we will omit *-superscripts in this proof, so z = z* and A - A,*. 

(i) Suppose V is an eigenvector of A - H - AI - zl^ with eigenvalue ^9^0. Define w - v + (■'^j z. 
Then 

[H - AI)w ^ [H - AI)v ^ [l^v + zt'^ v) ^ l^w, 

which shows that /i £ a{H - AI), and since fi^O, it follows that Re /x < 0. So all /i e cr(A) have 
Re /i < 0, except possibly the case where /i = but this case is excluded by Lemma[5j9]above. 

(ii) Let B -{H- A7)diag(z). Then B has a positive diagonal and negative off-diagonal entries. Fur- 
thermore B'i-0 and B-Q. This shows that B is row diagonally dominant and column diago- 
nally dominant, so that the same holds for B + B^ At follows that B + B^ is positive semidefinite. 
Note that B + B^ is a singular M-matrix (see |6|, Section 2.5.5). Since H is irreducible, also 
B + B^ is irreducible. Therefore the nuUspace olB + B^ is one-dimensional and it is spanned 
by I. Also zll^diag(z) = zz^ is symmetric positive semidefinite, and (11, zz^H) > 0. It follows that 
B + B^ + 2zz^ is positive definite. Multiplying on both sides by D (a congruence transform) 
gives that D(H- A7- z11^)diag(z)_D -i- diag(z)D(/J- A7- z11^)-^D is symmetric positive definite. 
By Lyapunov's theorem |6l Theorem 2.2.1], it follows that D{H- AI - zH^) is strictly stable. 

(iii) By (i) we have that A = H - AI - zt^ is strictly stable. In 2 dimensions, this is equivalent to 
det{A) > and ti{A) < 0. This immediately implies that det{DA) = det(D) det{A) > 0. We com- 
pute 



tr(DA) = tr 



di 
d2 



hn-A-zi hi2-zi 

hll -Z2 h22-A- Z2 



■ di{hn - A - zi) + ^2(^22 - A - Z2) < 0, 



since the diagonal ofH-AI has nonpositive entries (which may be seen by I5i, Theorem 8.3.2). 

□ 

We think the above proposition can be generalized significantly. In fact, we propose the following con- 
jecture. If the conjecture holds, Theorem lS. 21 (e) can be formulated unconditionally. 



5.11. Conjecture. Suppose H and D satisfy the conditions of Proposition l5.8l Then the matrix D{H- 
A*7-z*1l^) is strictly stable. 



6. Numerical experiment 

Consider the example of a gridworld (Figure[T](a)) , where some walls are present in a finite grid. Suppose 
the uncontrolled dynamics q allow to move through the walls, but walking through a wall is very costly, 
say a cost of 100 per step through a wall is incurred. Where there is no wall, a cost of 1 per step is 
incurred. There is a single state, in the bottom right, where no costs are incurred. The uncontrolled 
dynamics are such that with equal probability we may move left, right, up, down or stay where we are 
(but it is impossible to move out of the gridworld). The value function for this problem can be seen in 
Figure[T](b). In order to be able to compare our algorithm to the original Z-learning algorithm, the cost 
vector is normalized in such a way that A* - 1, so that Z-learning converges on the given input. 

The result of running the stochastic approximation algorithm, with a constant gain of 7 = 0.05 is por- 
trayed in Figure[T] (c) , where it is compared to Z-learning (see Section liTsl and 1 14 1 ) . This result may also 
be compared to the use of the power method in Figure [T](d). Here the following version of the power 
method is used, in order to be able to give a fair comparison with our stochastic method. 

Zk = Zk.i+Yk{Hzk-i -Zk-i). 

Note that for each iteration, the number of operations is (for sparse H) proportional to the number of 
non-zero elements in H. In the stochastic method the number of operations per iteration is of order 1. 
Comparing the graphs in Figure [T](c) and (d), we see that KL-learning does not disappoint in terms of 
speed of convergence, with respect to Z-learning as well as the power method. 
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(a) Grid world (b) Value function 
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(c) Comparison between Z-learning and KL-learning (d) Power method 

Figure 1 . Numerical experiment 



7. Discussion 



The strength of KL control is its very general applicability. The only requirements are the existence of 
some uncontrolled dynamics governed by a Markov chain, and some state or transition dependent cost. 
The Markov chain may actually be derived from a graph of allowed transitions, giving every allowed 
transition equal probability. A disadvantage is that we cannot directly influence the control cost; it is 
determined by the KL divergence. 

KL control is very useful if we know which moves (e.g. in a game) are allowed and we wish to find out 
which moves are best. The control cost of KL divergence form has a regularizing effect: no move will be 
made with probability one (unless it is the only allowed move). You could say that there is always a pos- 
sibility to perform an exploratory move, instead of an exploiting move, under the controlled dynamics. 

This immediately suggests the use of KL learning as a reinforcement learning algorithm. The initial tran- 
sition probabilities represent exploratory dynamics. At every iteration, we could compute a new version 
of the optimal transition probabilities and use these as a new mixture of exploitation and exploration. 
The practical implications of this idea will be the topic of further research. 

The KL learning algorithm seems to work well in practice and a basis has been provided for its theo- 
retical analysis. Some questions remain to be answered. In particular, if Conjecture 15.1 H is true, then 
regardless of the structure of the problem we know that the solution of the control problem is a locally 
asymptotically stable equilibrium of the algorithm. It would be even more convenient if a Lyapunov 
function for the ODE fTsl could be found, which would imply global convergence of KL learning. 
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So far numerical results indicate that KL learning is a reliable algorithm. In the near future we will 
apply it to practical examples and evaluate its performance relative to other reinforcement learning 
algorithms. 
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